Outlier Detection for new set of samples for 6 populations (NAR and GHP excluded)

Documentation for read trimming, mapping, variant calling, and SNP filtering can be accessed here. Four outlier detection programs were used to identify outlier SNPs (putatively under selection). pcadapt, OutFLANK, and BayeScan identify outliers based on population structure. LFMM2 identifies outliers based on associations with environmental predictors.

In terminal:

In NB_ddhaplo_working directory, make new directory for outlier detection and link vcf files and popmap file

$ mkdir NB_OutlierDetection_working
$ cd NB_OutlierDetection_working 

$ cp ../NB_SNPFiltering_working/SNP.TRS6newdp10mafp9g9nDNAmaf052A.recode.vcf.gz . 
$ cp ../NB_SNPFiltering_working/popmap . 

Unzipping gzipped vcf file

$ gzip -d SNP.TRS6newdp10mafp9g9nDNAmaf052A.recode.vcf.gz

In R:

1. pcadapt

# install and load pcadapt
#install.packages("pcadapt")
library(pcadapt)
# Convert VCF file to pcadapt format

vcf2pcadapt("SNP.TRS6newdp10mafp9g9nDNAmaf052A.recode.vcf", output = "tmp.pcadapt", allele.sep = c("/", "|"))

filename <- read.pcadapt("tmp.pcadapt", type = "pcadapt")
# Choosing the number K of Principal Components
# We start off with a large number of PCs
x6 <- pcadapt(input = filename, K = 20)
# Plot the likelihoods using a screeplot
plot(x6, option = "screeplot")
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the pcadapt package.
##   Please report the issue at <https://github.com/bcm-uga/pcadapt/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Plotting likelihoods again, but with 10 Principal Components
plot(x6, option = "screeplot", K = 10)

# Create population designations minus NAR, NAR2, GHP
poplist.names6 <- c(rep("BAR", 10),rep("BIS", 10),rep("GB", 10), rep("KIC", 10),rep("MCD", 10), rep("PVD", 10))
# Score Plot
# Plot the first two PCs
plot(x6, option = "scores", pop = poplist.names6)
## Warning: Use of `df$Pop` is discouraged.
## ℹ Use `Pop` instead.

# PCA plot with PCs 3 and 4
plot(x6, option = "scores", i = 3, j = 4, pop = poplist.names6)
## Warning: Use of `df$Pop` is discouraged.
## ℹ Use `Pop` instead.

#evaluating LD in dataset - looking to see if loading are clustered in certain genomic regions (following https://bcm-uga.github.io/pcadapt/articles/pcadapt.html#f--linkage-disequilibrium-ld-thinning)
par(mfrow = c(2, 2))
for (i in 1:4)
  plot(x6$loadings[, i], pch = 19, cex = .3, ylab = paste0("Loadings PC", i))

Top left, PC1, is determined by one (maybe 2) genomic regions, also likely still an inversion. To minimize outlier detection based on LD, I can apply LD thinning. I’m going to use the default values for window size and r2 threshold for pcadapt (size = 500, thr = 0.1).

# performing SNP thinning in order to compute PCs
res6 <- pcadapt(filename, K = 10, LD.clumping = list(size = 500, thr = 0.1))
plot(res6, option = "screeplot")

Based on this scree plot, it’s not super clear where the elbow is (where total variance starts to drop off). I’m going to try a few different PC values moving forward. After some exploratory testing, I am moving forward with K = 4.

# Score Plot
# Plot the first two PCs
plot(res6, option = "scores", pop = poplist.names6)
## Warning: Use of `df$Pop` is discouraged.
## ℹ Use `Pop` instead.

# PCA plot with PCs 3 and 4
plot(res6, option = "scores", i = 3, j = 4, pop = poplist.names6)
## Warning: Use of `df$Pop` is discouraged.
## ℹ Use `Pop` instead.

res6 <- pcadapt(filename, K = 4, LD.clumping = list(size = 500, thr = 0.1))
par(mfrow = c(1,1))
for (i in 1)
  plot(res6$loadings[, i], pch = 19, cex = .3, ylab = paste0("Loadings PC", i))

No evidence of the inversions show up in the PC1 loadings.

plot(res6)

We still see some evidence of the inversions in the manhattan plot, however, this may be due to stronger selective pressures in the inverted regions.

#evaluating LD in dataset - looking to see if loading are still clustered in certain genomic regions (following https://bcm-uga.github.io/pcadapt/articles/pcadapt.html#f--linkage-disequilibrium-ld-thinning)
par(mfrow = c(2, 2))
for (i in 1:4)
  plot(res6$loadings[, i], pch = 19, cex = .3, ylab = paste0("Loadings PC", i))

res6$singular.values
## [1] 0.05504607 0.05443791 0.05423507 0.05373564
# Create Q-Q Plot
plot(res6, option = "qqplot", threshold = 0.1)

hist(res6$pvalues, xlab = "p-values", main = NULL, breaks = 50, col = "orange")

# Look at p-value distribution 
plot(res6, option = "stat.distribution")
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## ℹ The deprecated feature was likely used in the pcadapt package.
##   Please report the issue at <https://github.com/bcm-uga/pcadapt/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 19 rows containing non-finite outside the scale range
## (`stat_bin()`).

Assigning an alpha value

library(qvalue) # transforms p-values into q-values.
qval6 <- qvalue(res6$pvalues)$qvalues
alpha6 <- 0.1 # expected false discovery rate lower than 1% - raised from 0.05

Save Outliers

outliers6 <- which(qval6 < alpha6)
outliers6
##  [1]   154   171   172   173   174   751   947  1652  1662  2865  5329  6505
## [13]  8127  8206  8334  8474  8477  8482 10559 11546 13494 15497 16911 16963
## [25] 16967 16968 16987 16988 16990 16991 17042 17142 17439 17445 17616 17629
## [37] 17780 17784 17785 17789 17808 17809 18650 18663 18679 18709 18761 18808
## [49] 18812 18887 18890 18924 19011 19091 19094 19125 19130 19134

With a K = 4, and LD clumping with a window size = 500 bp and squared correlation threshold = 0.1, 58 outlier SNPs were identified.

# Save outliers to .txt file 
invisible(lapply(outliers6, write, "outliers_pcadapt_thinned500.txt", append=TRUE))

2. OutFLANK

# Load all necessary packages
library(OutFLANK)  # outflank package
library(vcfR)
## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.12.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****
library(bigsnpr)   # package for Linkage Disequilibrium pruning
## Loading required package: bigstatsr
# Convert VCF to OutFLANK format
my_vcf6 <- read.vcfR("SNP.TRS6newdp10mafp9g9nDNAmaf052A.recode.vcf")
## Scanning file to determine attributes.
## File attributes:
##   meta lines: 78
##   header_line: 79
##   variant count: 22874
##   column count: 69
## 
Meta line 78 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 22874
##   Character matrix gt cols: 69
##   skip: 0
##   nrows: 22874
##   row_num: 0
## 
Processed variant 1000
Processed variant 2000
Processed variant 3000
Processed variant 4000
Processed variant 5000
Processed variant 6000
Processed variant 7000
Processed variant 8000
Processed variant 9000
Processed variant 10000
Processed variant 11000
Processed variant 12000
Processed variant 13000
Processed variant 14000
Processed variant 15000
Processed variant 16000
Processed variant 17000
Processed variant 18000
Processed variant 19000
Processed variant 20000
Processed variant 21000
Processed variant 22000
Processed variant: 22874
## All variants processed
geno6 <- extract.gt(my_vcf6) # Character matrix containing the genotypes
position6 <- getPOS(my_vcf6) # Positions in bp
chromosome6 <- getCHROM(my_vcf6) # Chromosome information

G6 <- matrix(NA, nrow = nrow(geno6), ncol = ncol(geno6))

G6[geno6 %in% c("0/0", "0|0")] <- 0
G6[geno6  %in% c("0/1", "1/0", "1|0", "0|1")] <- 1
G6[geno6 %in% c("1/1", "1|1")] <- 2

# NA should be replaced with “9” to work with the functions in the OutFLANK package
G6[is.na(G6)] <- 9

head(G6[,1:10])
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    0    0    0    0    0    0    0    0     0
## [2,]    0    0    1    0    0    0    0    0    0     0
## [3,]    0    0    1    0    0    0    0    0    0     0
## [4,]    0    0    2    0    0    0    0    0    0     0
## [5,]    0    0    2    0    0    0    0    0    0     0
## [6,]    0    0    0    0    1    0    1    0    0     0
# Convert pop designations to a vector
pop6 <- as.vector(poplist.names6)
pop6
##  [1] "BAR" "BAR" "BAR" "BAR" "BAR" "BAR" "BAR" "BAR" "BAR" "BAR" "BIS" "BIS"
## [13] "BIS" "BIS" "BIS" "BIS" "BIS" "BIS" "BIS" "BIS" "GB"  "GB"  "GB"  "GB" 
## [25] "GB"  "GB"  "GB"  "GB"  "GB"  "GB"  "KIC" "KIC" "KIC" "KIC" "KIC" "KIC"
## [37] "KIC" "KIC" "KIC" "KIC" "MCD" "MCD" "MCD" "MCD" "MCD" "MCD" "MCD" "MCD"
## [49] "MCD" "MCD" "PVD" "PVD" "PVD" "PVD" "PVD" "PVD" "PVD" "PVD" "PVD" "PVD"
# Calculate Fst
my_fst6 <- MakeDiploidFSTMat(t(G6), locusNames = paste0(chromosome6,"_", position6), popNames = pop6)
## Calculating FSTs, may take a few minutes...
## [1] "10000 done of 22874"
## [1] "20000 done of 22874"
head(my_fst6)
##            LocusName        He         FST           T1         T2   FSTNoCorr
## 1 NC_035789.1_147427 0.1116059  0.05531086  0.003146064 0.05687967 0.113516624
## 2 NC_035789.1_147498 0.1098611 -0.04377104 -0.002407407 0.05500000 0.007575758
## 3 NC_035789.1_147507 0.1244444 -0.03980986 -0.002481481 0.06233333 0.010695187
## 4 NC_035789.1_147556 0.1551278 -0.05293398 -0.004112412 0.07768946 0.009789265
## 5 NC_035789.1_147576 0.1409078 -0.03605094 -0.002551254 0.07076804 0.026704323
## 6 NC_035789.1_147582 0.1116059 -0.04495529 -0.002511675 0.05587052 0.007337601
##       T1NoCorr   T2NoCorr meanAlleleFreq
## 1 0.0064568966 0.05688063      0.9406780
## 2 0.0004166667 0.05500000      0.9416667
## 3 0.0006666667 0.06233333      0.9333333
## 4 0.0007605364 0.07769086      0.9152542
## 5 0.0018898467 0.07076932      0.9237288
## 6 0.0004099617 0.05587136      0.9406780
# the output will be a table showing Fst statistics for each locus
# Plot heterozygosity vs Fst
plot(my_fst6$He, my_fst6$FST)

# Plot Fst vs FstNoCorr
plot(my_fst6$FST, my_fst6$FSTNoCorr)
abline(0,1)

We need to give OUTFlank a set of quasi-independent SNPs to estimate the neutral FST distribution.

# Load additional packages
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("stringr")
# Chromosomes need to be of class integer for this to work.
# Visualizing chromosomes
chrom_unique6 <- unique(chromosome6)
print(chrom_unique6)
##  [1] "NC_035789.1" "NC_035780.1" "NC_035781.1" "NC_035782.1" "NC_035783.1"
##  [6] "NC_035784.1" "NC_035785.1" "NC_035786.1" "NC_035787.1" "NC_035788.1"
# Removing "NC_" from the chromosome name so it can be converted to an integer
chrom_new6 <- chromosome6%>%str_replace("NC_","")

# Converting the character string into an integer
chrom6 <- as.integer(chrom_new6)
#chromosome and position need to be sorted for imputation to work.
chrom_sort6 <- sort(chrom6)
pos_sort6 <- sort(position6)

Note: This filtering program does not allow for missing genotype values.

Using snp_fastImputeSimple to impute missing genotypes

# Converting the genotype matrix to class FBM.code256
G1_6 <- add_code256(big_copy(t(G6),type="raw"),code=bigsnpr:::CODE_012)

# Imputing missing genotypes
G_missSimple6 <- snp_fastImputeSimple(G1_6, method = "mode")

The method of imputation can be either “random” (sampling according to allele frequencies), “mean0” (rounded mean), “mean2” (rounded mean to 2 decimal places), “mode” (most frequent call). “Random” makes the most sense to me because imputation would be based on similar allele frequencies (more likely to fill in potential outliers) rather than generalized based on mean or most frequently called. I tested all four methods and 0 outliers were identified for all, so it doesn’t seem to make a difference in this data set.

newpc6 <-snp_autoSVD(G_missSimple6,infos.chr = chrom_sort6,infos.pos = pos_sort6)
## 
## Phase of clumping (on MAF) at r^2 > 0.2.. keep 12015 SNPs.
## Discarding 3137 variants with MAC < 10.
## 
## Iteration 1:
## Computing SVD..
## 37 outlier variants detected..
## 1 long-range LD region detected..
## 
## Iteration 2:
## Computing SVD..
## 54 outlier variants detected..
## 2 long-range LD regions detected..
## 
## Iteration 3:
## Computing SVD..
## 0 outlier variant detected..
## 
## Converged!
which_pruned6 <- attr(newpc6, which="subset") # Indexes of remaining SNPS after pruning
length(which_pruned6)
## [1] 8787
head(which_pruned6)
## [1]  4  7 10 11 12 19

Run the OutFLANK() function to estimate the parameters on the neutral FST distribution.

out_trim6 <- OutFLANK(my_fst6[which_pruned6,], NumberOfSamples=60, qthreshold = 0.05, Hmin = 0.1)
str(out_trim6)
## List of 6
##  $ FSTbar               : num 0.000955
##  $ FSTNoCorrbar         : num 0.0486
##  $ dfInferred           : num 5.33
##  $ numberLowFstOutliers : int 0
##  $ numberHighFstOutliers: int 0
##  $ results              :'data.frame':   8787 obs. of  15 variables:
##   ..$ LocusName       : Factor w/ 22874 levels "NC_035780.1_10067423",..: 22591 22625 22628 22640 22641 22744 22757 22758 22783 22786 ...
##   ..$ He              : num [1:8787] 0.155 0.428 0.274 0.479 0.155 ...
##   ..$ FST             : num [1:8787] -0.05293 -0.00637 0.01919 -0.01497 0.01537 ...
##   ..$ T1              : num [1:8787] -0.00411 -0.00137 0.00266 -0.0036 0.00121 ...
##   ..$ T2              : num [1:8787] 0.0777 0.2156 0.1385 0.2405 0.0785 ...
##   ..$ FSTNoCorr       : num [1:8787] 0.00979 0.04399 0.06594 0.03073 0.0716 ...
##   ..$ T1NoCorr        : num [1:8787] 0.000761 0.009485 0.009132 0.007392 0.005621 ...
##   ..$ T2NoCorr        : num [1:8787] 0.0777 0.2156 0.1385 0.2405 0.0785 ...
##   ..$ meanAlleleFreq  : num [1:8787] 0.915 0.69 0.164 0.603 0.915 ...
##   ..$ indexOrder      : int [1:8787] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ GoodH           : Factor w/ 1 level "goodH": 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ qvalues         : num [1:8787] 0.97 0.934 0.914 0.959 0.914 ...
##   ..$ pvalues         : num [1:8787] 0.0648 0.9616 0.4676 0.6297 0.3809 ...
##   ..$ pvaluesRightTail: num [1:8787] 0.968 0.481 0.234 0.685 0.19 ...
##   ..$ OutlierFlag     : logi [1:8787] FALSE FALSE FALSE FALSE FALSE FALSE ...
# Check the fit and make sure it looks good, especially in the right tail:
OutFLANKResultsPlotter(out_trim6, withOutliers = TRUE, NoCorr = TRUE, Hmin = 0.1,binwidth = 0.001, Zoom = FALSE, RightZoomFraction = 0.05, titletext = NULL)

# Check the p-value histogram
hist(out_trim6$results$pvaluesRightTail)

# Using estimated neutral mean FST and df to calculate P-values for all loci
P1_6 <- pOutlierFinderChiSqNoCorr(my_fst6, Fstbar = out_trim6$FSTNoCorrbar, dfInferred = out_trim6$dfInferred, qthreshold = 0.05, Hmin=0.1)
# Identify outliers SNPs
my_out6 <- P1_6$OutlierFlag==TRUE
plot(P1_6$He, P1_6$FST, pch=19, col=rgb(0,0,0,0.1))
points(P1_6$He[my_out6], P1_6$FST[my_out6], col="blue")

P1_6[which(P1_6$OutlierFlag == TRUE),]
##  [1] LocusName        He               FST              T1              
##  [5] T2               FSTNoCorr        T1NoCorr         T2NoCorr        
##  [9] meanAlleleFreq   pvalues          pvaluesRightTail qvalues         
## [13] OutlierFlag     
## <0 rows> (or 0-length row.names)
outliers_out6 <- P1_6[which(P1_6$OutlierFlag == TRUE),]
outliers_list6 <- outliers_out6$LocusName
outliers_list6
## factor(0)
## 22874 Levels: NC_035780.1_10067423 NC_035780.1_1007505 ... NC_035789.1_9490358

No outliers detected

3. BayeScan

In terminal:

Convert VCF file to other outputs

# load in configuration file that will convert VCF to BayeScan format
$ curl -L -O https://raw.githubusercontent.com/amyzyck/EecSeq_NB_EasternOyster/master/Scripts/BSsnp.spid
$ chmod +x BSsnp.spid

$ java -jar /usr/local/bin/PGDSpider2-cli.jar -spid BSsnp.spid -inputfile SNP.TRS6newdp10mafp9g9nDNAmaf052A.recode.vcf -outputfile SNP6p9g9.BS

output:
WARN  12:07:05 - PGDSpider configuration file not found! Loading default configuration.
initialize convert process...
read input file...
read input file done.
write output file...
write output file done.

Run BayeScan

$ BayeScan2.1_linux64bits SNP6p9g9.BS -thin 50 -nbp 30

In R:

Plotting outliers

source("BayeScan/plot_R.r")
plot_bayescan("BayeScan/SNP6p9g9_fst.txt")

## $outliers
## integer(0)
## 
## $nb_outliers
## [1] 0
# Identifying outliers using a False discovery rate of 10%.
bs6p9g9 <- plot_bayescan("BayeScan/SNP6p9g9_fst.txt", FDR = 0.10)

bs6p9g9$outliers
## integer(0)

No outliers detected

4. LFMM2

Following steps documented by pgugger here to prep the genomic and environmental data.

loading the required libraries

#if (!require("BiocManager", quietly = TRUE))
#install.packages("BiocManager")

#BiocManager::install("LEA")
library(LEA)

For LFMM, the input genomic data needs to be SNPs as genotypes encoded 0, 1, or 2 for homozygous, heterozygous, and other homozygous, respectively. Missing data is coded as “9”.

In terminal:

# Converting vcf to genotyped file for input
$ vcftools --vcf SNP.TRS6newdp10mafp9g9nDNAmaf052A.recode.vcf --012 --out snp6p9g9

# Replacing missing data with 9 and creating the .lfmm file
$ sed 's/-1/9/g' snp6p9g9.012 | cut -f2- > snp6p9g9.lfmm

I also have a vcf file with SNPs inside known inversion removed. I want to run LFMM2 on this data set too.

$ vcftools --vcf SNP.TRS6newdp10mafp9g9nDNAmaf052A_noinvert.recode.vcf --012 --out snp6p9g9_noinvert

$ sed 's/-1/9/g' snp6p9g9_noinvert.012 | cut -f2- > snp6p9g9_noinvert.lfmm
# run snmf to estimate K, considering K from 1-6:
project_6p9g9 = NULL
project_6p9g9 = snmf("LFMM2/snp6p9g9.lfmm", K = 1:10, entropy = TRUE, repetitions = 10, project = "new")
#pdf("sNMF_6p9g9.pdf")
plot(project_6p9g9, col = "blue", pch = 19, cex = 1.2)

#dev.off()

Based on the plot of cross-entropy, K = 1 has the lowest cross-entropy value.

# Generating a Structure-type plot. A K = 1 is not a very interesting plot to look at, so I also generated plots with K = 2-6. With K = 6, the individuals are spread across the 6 ancestry pops. 
best_6p9g9 = which.min(cross.entropy(project_6p9g9, K = 6))
# pdf("sNMF.barchart_6p9g9.pdf")
barchart(project_6p9g9, K = 6, run = best_6p9g9, border = NA, space = 0, col = c("red", "blue","green","yellow","orange","pink"), xlab = "Individuals", ylab = "Ancestry proportions") -> bp_6p9g9
axis(1, at = 1:length(bp_6p9g9$order), labels = bp_6p9g9$order, las=1, cex.axis = .3)

#dev.off()
#imputing any missing data
project.missing_6p9g9 = snmf("LFMM2/snp6p9g9.lfmm", K = 1,
        entropy = TRUE, repetitions = 10,
        project = "new")
# select the run with the lowest cross-entropy value
best_6p9g9 = which.min(cross.entropy(project.missing_6p9g9, K = 1))
# Impute the missing genotypes
impute(project.missing_6p9g9, "LFMM2/snp6p9g9.lfmm",
       method = 'random', K = 1, run = best_6p9g9)
## Missing genotype imputation for K = 1 
## Missing genotype imputation for run = 9 
## Results are written in the file:  LFMM2/snp6p9g9.lfmm_imputed.lfmm
## Missing genotype imputation for K = 1
## Missing genotype imputation for run = 6
## Results are written in the file:  snp_.lfmm_imputed.lfmm
# Proportion of correct imputation results
dat.imp_6p9g9 = read.lfmm("LFMM2/snp6p9g9.lfmm_imputed.lfmm")
#mean( tutorial.R[dat == 9] == dat.imp[dat == 9] )

Repeating with no invert dataset

# run snmf to estimate K, considering K from 1-6:
project_noinvert_6p9g9 = NULL
project_noinvert_6p9g9 = snmf("LFMM2/snp6p9g9_noinvert.lfmm", K = 1:10, entropy = TRUE, repetitions = 10, project = "new")
#pdf("sNMF_noinvert_6p9g9.pdf")
plot(project_noinvert_6p9g9, col = "blue", pch = 19, cex = 1.2)

#dev.off()

Based on the plot of cross-entropy, K = 1 has the lowest cross-entropy value.

# Generating a Structure-type plot. A K = 1 is not a very interesting plot to look at, so I also generated plots with K = 2-6. With K = 6, the individuals are spread across the 6 ancestry pops. 
best_noinvert_6p9g9 = which.min(cross.entropy(project_noinvert_6p9g9, K = 6))
# pdf("sNMF.barchart_noinvert_6p9g9.pdf")
barchart(project_noinvert_6p9g9, K = 6, run = best_noinvert_6p9g9, border = NA, space = 0, col = c("red", "blue","green","yellow","orange","pink"), xlab = "Individuals", ylab = "Ancestry proportions") -> bp_noinvert_6p9g9
axis(1, at = 1:length(bp_noinvert_6p9g9$order), labels = bp_noinvert_6p9g9$order, las=1, cex.axis = .3)

#dev.off()
#imputing any missing data
project.missing_noinvert_6p9g9 = snmf("LFMM2/snp6p9g9_noinvert.lfmm", K = 1,
        entropy = TRUE, repetitions = 10,
        project = "new")
# select the run with the lowest cross-entropy value
best_noinvert_6p9g9 = which.min(cross.entropy(project.missing_noinvert_6p9g9, K = 1))
# Impute the missing genotypes
impute(project.missing_noinvert_6p9g9, "LFMM2/snp6p9g9_noinvert.lfmm",
       method = 'mode', K = 1, run = best_noinvert_6p9g9)
## Missing genotype imputation for K = 1 
## Missing genotype imputation for run = 5 
## Results are written in the file:  LFMM2/snp6p9g9_noinvert.lfmm_imputed.lfmm
## Missing genotype imputation for K = 1
## Missing genotype imputation for run = 6
## Results are written in the file:  snp_.lfmm_imputed.lfmm
# Proportion of correct imputation results
dat.imp_noinvert_6p9g9 = read.lfmm("LFMM2/snp6p9g9_noinvert.lfmm_imputed.lfmm")
#mean( tutorial.R[dat == 9] == dat.imp[dat == 9] )
# Prepping the environmental data

clim.env_6pops_pca_red <- read.table("./env_data_pca_reduced", header=TRUE)

clim.env_6pops_pca_red <- clim.env_6pops_pca_red[,5:17]

colnames(clim.env_6pops_pca_red) <- NULL
row.names(clim.env_6pops_pca_red) <- NULL

clim.env_6pops_pca_red
##                                                                            
## 1  17.881340 12.20 30.50 28.50 21.10 24.35 27.85 25.00 7.69 5.71 -3.6501277
## 2  17.881340 12.20 30.50 28.50 21.10 24.35 27.85 25.00 7.69 5.71 -3.6501277
## 3  17.881340 12.20 30.50 28.50 21.10 24.35 27.85 25.00 7.69 5.71 -3.6501277
## 4  17.881340 12.20 30.50 28.50 21.10 24.35 27.85 25.00 7.69 5.71 -3.6501277
## 5  17.881340 12.20 30.50 28.50 21.10 24.35 27.85 25.00 7.69 5.71 -3.6501277
## 6  17.881340 12.20 30.50 28.50 21.10 24.35 27.85 25.00 7.69 5.71 -3.6501277
## 7  17.881340 12.20 30.50 28.50 21.10 24.35 27.85 25.00 7.69 5.71 -3.6501277
## 8  17.881340 12.20 30.50 28.50 21.10 24.35 27.85 25.00 7.69 5.71 -3.6501277
## 9  17.881340 12.20 30.50 28.50 21.10 24.35 27.85 25.00 7.69 5.71 -3.6501277
## 10 17.881340 12.20 30.50 28.50 21.10 24.35 27.85 25.00 7.69 5.71 -3.6501277
## 11  8.824636  9.24 29.44 31.58 20.63 24.51 26.63 14.19 7.94 8.09  1.8262205
## 12  8.824636  9.24 29.44 31.58 20.63 24.51 26.63 14.19 7.94 8.09  1.8262205
## 13  8.824636  9.24 29.44 31.58 20.63 24.51 26.63 14.19 7.94 8.09  1.8262205
## 14  8.824636  9.24 29.44 31.58 20.63 24.51 26.63 14.19 7.94 8.09  1.8262205
## 15  8.824636  9.24 29.44 31.58 20.63 24.51 26.63 14.19 7.94 8.09  1.8262205
## 16  8.824636  9.24 29.44 31.58 20.63 24.51 26.63 14.19 7.94 8.09  1.8262205
## 17  8.824636  9.24 29.44 31.58 20.63 24.51 26.63 14.19 7.94 8.09  1.8262205
## 18  8.824636  9.24 29.44 31.58 20.63 24.51 26.63 14.19 7.94 8.09  1.8262205
## 19  8.824636  9.24 29.44 31.58 20.63 24.51 26.63 14.19 7.94 8.09  1.8262205
## 20  8.824636  9.24 29.44 31.58 20.63 24.51 26.63 14.19 7.94 8.09  1.8262205
## 21 14.596049 11.40 31.00 29.00 20.00 24.83 28.32 22.00 7.67 5.61 -3.0116343
## 22 14.596049 11.40 31.00 29.00 20.00 24.83 28.32 22.00 7.67 5.61 -3.0116343
## 23 14.596049 11.40 31.00 29.00 20.00 24.83 28.32 22.00 7.67 5.61 -3.0116343
## 24 14.596049 11.40 31.00 29.00 20.00 24.83 28.32 22.00 7.67 5.61 -3.0116343
## 25 14.596049 11.40 31.00 29.00 20.00 24.83 28.32 22.00 7.67 5.61 -3.0116343
## 26 14.596049 11.40 31.00 29.00 20.00 24.83 28.32 22.00 7.67 5.61 -3.0116343
## 27 14.596049 11.40 31.00 29.00 20.00 24.83 28.32 22.00 7.67 5.61 -3.0116343
## 28 14.596049 11.40 31.00 29.00 20.00 24.83 28.32 22.00 7.67 5.61 -3.0116343
## 29 14.596049 11.40 31.00 29.00 20.00 24.83 28.32 22.00 7.67 5.61 -3.0116343
## 30 14.596049 11.40 31.00 29.00 20.00 24.83 28.32 22.00 7.67 5.61 -3.0116343
## 31 56.312594 11.90 30.00 27.70 21.70 24.21 28.16 26.00 7.84 6.27 -3.2867056
## 32 56.312594 11.90 30.00 27.70 21.70 24.21 28.16 26.00 7.84 6.27 -3.2867056
## 33 56.312594 11.90 30.00 27.70 21.70 24.21 28.16 26.00 7.84 6.27 -3.2867056
## 34 56.312594 11.90 30.00 27.70 21.70 24.21 28.16 26.00 7.84 6.27 -3.2867056
## 35 56.312594 11.90 30.00 27.70 21.70 24.21 28.16 26.00 7.84 6.27 -3.2867056
## 36 56.312594 11.90 30.00 27.70 21.70 24.21 28.16 26.00 7.84 6.27 -3.2867056
## 37 56.312594 11.90 30.00 27.70 21.70 24.21 28.16 26.00 7.84 6.27 -3.2867056
## 38 56.312594 11.90 30.00 27.70 21.70 24.21 28.16 26.00 7.84 6.27 -3.2867056
## 39 56.312594 11.90 30.00 27.70 21.70 24.21 28.16 26.00 7.84 6.27 -3.2867056
## 40 56.312594 11.90 30.00 27.70 21.70 24.21 28.16 26.00 7.84 6.27 -3.2867056
## 41 12.111228  6.49 25.84 33.94 21.57 26.39 21.71  9.35 7.69 7.53  7.8111068
## 42 12.111228  6.49 25.84 33.94 21.57 26.39 21.71  9.35 7.69 7.53  7.8111068
## 43 12.111228  6.49 25.84 33.94 21.57 26.39 21.71  9.35 7.69 7.53  7.8111068
## 44 12.111228  6.49 25.84 33.94 21.57 26.39 21.71  9.35 7.69 7.53  7.8111068
## 45 12.111228  6.49 25.84 33.94 21.57 26.39 21.71  9.35 7.69 7.53  7.8111068
## 46 12.111228  6.49 25.84 33.94 21.57 26.39 21.71  9.35 7.69 7.53  7.8111068
## 47 12.111228  6.49 25.84 33.94 21.57 26.39 21.71  9.35 7.69 7.53  7.8111068
## 48 12.111228  6.49 25.84 33.94 21.57 26.39 21.71  9.35 7.69 7.53  7.8111068
## 49 12.111228  6.49 25.84 33.94 21.57 26.39 21.71  9.35 7.69 7.53  7.8111068
## 50 12.111228  6.49 25.84 33.94 21.57 26.39 21.71  9.35 7.69 7.53  7.8111068
## 51 59.860038  8.78 29.10 27.99 19.73 24.45 18.66  5.01 7.63 6.95  0.3111403
## 52 59.860038  8.78 29.10 27.99 19.73 24.45 18.66  5.01 7.63 6.95  0.3111403
## 53 59.860038  8.78 29.10 27.99 19.73 24.45 18.66  5.01 7.63 6.95  0.3111403
## 54 59.860038  8.78 29.10 27.99 19.73 24.45 18.66  5.01 7.63 6.95  0.3111403
## 55 59.860038  8.78 29.10 27.99 19.73 24.45 18.66  5.01 7.63 6.95  0.3111403
## 56 59.860038  8.78 29.10 27.99 19.73 24.45 18.66  5.01 7.63 6.95  0.3111403
## 57 59.860038  8.78 29.10 27.99 19.73 24.45 18.66  5.01 7.63 6.95  0.3111403
## 58 59.860038  8.78 29.10 27.99 19.73 24.45 18.66  5.01 7.63 6.95  0.3111403
## 59 59.860038  8.78 29.10 27.99 19.73 24.45 18.66  5.01 7.63 6.95  0.3111403
## 60 59.860038  8.78 29.10 27.99 19.73 24.45 18.66  5.01 7.63 6.95  0.3111403
##                        
## 1   1.998418 -0.8784983
## 2   1.998418 -0.8784983
## 3   1.998418 -0.8784983
## 4   1.998418 -0.8784983
## 5   1.998418 -0.8784983
## 6   1.998418 -0.8784983
## 7   1.998418 -0.8784983
## 8   1.998418 -0.8784983
## 9   1.998418 -0.8784983
## 10  1.998418 -0.8784983
## 11 -2.447613  3.3327135
## 12 -2.447613  3.3327135
## 13 -2.447613  3.3327135
## 14 -2.447613  3.3327135
## 15 -2.447613  3.3327135
## 16 -2.447613  3.3327135
## 17 -2.447613  3.3327135
## 18 -2.447613  3.3327135
## 19 -2.447613  3.3327135
## 20 -2.447613  3.3327135
## 21  3.079209 -0.6975026
## 22  3.079209 -0.6975026
## 23  3.079209 -0.6975026
## 24  3.079209 -0.6975026
## 25  3.079209 -0.6975026
## 26  3.079209 -0.6975026
## 27  3.079209 -0.6975026
## 28  3.079209 -0.6975026
## 29  3.079209 -0.6975026
## 30  3.079209 -0.6975026
## 31 -1.122021  2.4790058
## 32 -1.122021  2.4790058
## 33 -1.122021  2.4790058
## 34 -1.122021  2.4790058
## 35 -1.122021  2.4790058
## 36 -1.122021  2.4790058
## 37 -1.122021  2.4790058
## 38 -1.122021  2.4790058
## 39 -1.122021  2.4790058
## 40 -1.122021  2.4790058
## 41  2.375912 -0.2570490
## 42  2.375912 -0.2570490
## 43  2.375912 -0.2570490
## 44  2.375912 -0.2570490
## 45  2.375912 -0.2570490
## 46  2.375912 -0.2570490
## 47  2.375912 -0.2570490
## 48  2.375912 -0.2570490
## 49  2.375912 -0.2570490
## 50  2.375912 -0.2570490
## 51 -3.883906 -3.9786694
## 52 -3.883906 -3.9786694
## 53 -3.883906 -3.9786694
## 54 -3.883906 -3.9786694
## 55 -3.883906 -3.9786694
## 56 -3.883906 -3.9786694
## 57 -3.883906 -3.9786694
## 58 -3.883906 -3.9786694
## 59 -3.883906 -3.9786694
## 60 -3.883906 -3.9786694
write.table(clim.env_6pops_pca_red, "/home/azyck/NB_capture_both/NB_ddhaplo/NB_ddhaplo_working/NB_OutlierDetection_working/clim.env_6pops_pca_red.env", sep="\t", quote=F, row.names=F)
# Prepping a second env file for downstream processing of LFMM output
clim.env_6pops_pca_red_full <- read.table("./env_data_pca_reduced", header=TRUE)

env_6pops_pca_red <- clim.env_6pops_pca_red_full[,5:17]

LFMM ridge

I am following steps documented by B.R. Forester here for running the LFMM ridge model.

# if(!requireNamespace("qvalue", quietly = TRUE)) {  
# if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install(version = "3.14")
# BiocManager::install("qvalue")
# }
# if(!requireNamespace("lfmm", quietly = TRUE)) {  
#  remotes::install_github("bcm-uga/lfmm")
# }
library(vegan)    # Used to run PCA & RDA
## Loading required package: permute
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:LEA':
## 
##     barchart
## This is vegan 2.5-7
library(lfmm)     # Used to run LFMM
library(qvalue)   # Used to post-process LFMM output
library(vcfR)
library(stringr)

Import the genetic data

# to make it tab separated - sed 's/ \+/\t/g' snp6p9g9.lfmm_imputed.lfmm  > snp6p9g9.lfmm_imputed_tab.lfmm
gen6p9g9<-read.delim("LFMM2/snp6p9g9.lfmm_imputed_tab.lfmm",header = FALSE)
row.names(gen6p9g9) <- c("BAR_10","BAR_1","BAR_2","BAR_3","BAR_4","BAR_5","BAR_6","BAR_7","BAR_8","BAR_9","BIS_10","BIS_1","BIS_2","BIS_3","BIS_4","BIS_5","BIS_6","BIS_7","BIS_8","BIS_9","GB_10","GB_1","GB_2","GB_3","GB_4","GB_5","GB_6","GB_7","GB_8","GB_9","KIC_10","KIC_1","KIC_2","KIC_3","KIC_4","KIC_5","KIC_6","KIC_7","KIC_8","KIC_9","MCD_10","MCD_1","MCD_2","MCD_3","MCD_4","MCD_5","MCD_6","MCD_7","MCD_8","MCD_9","PVD_10","PVD_1","PVD_2","PVD_3","PVD_4","PVD_5","PVD_6","PVD_7","PVD_8","PVD_9") # Adding individual names to rows of matrix
dim(gen6p9g9)
## [1]    60 22874
#LFMM requires a complete dataframe. I am using the snp_6.lfmm_imputed_tab.lfmm file created for running lfmm above, which imputes missing data based on the most common genotype for each SNP. I will move forward with this and see what happens
# No NAs 
sum(is.na(gen6p9g9))
## [1] 0

Repeat for no_invert dataset

# to make it tab separated - sed 's/ \+/\t/g' snp6p9g9_noinvert.lfmm_imputed.lfmm  > snp6p9g9_noinvert.lfmm_imputed_tab.lfmm
gen_noinvert_6p9g9<-read.delim("LFMM2/snp6p9g9_noinvert.lfmm_imputed_tab.lfmm",header = FALSE)
row.names(gen_noinvert_6p9g9) <- c("BAR_10","BAR_1","BAR_2","BAR_3","BAR_4","BAR_5","BAR_6","BAR_7","BAR_8","BAR_9","BIS_10","BIS_1","BIS_2","BIS_3","BIS_4","BIS_5","BIS_6","BIS_7","BIS_8","BIS_9","GB_10","GB_1","GB_2","GB_3","GB_4","GB_5","GB_6","GB_7","GB_8","GB_9","KIC_10","KIC_1","KIC_2","KIC_3","KIC_4","KIC_5","KIC_6","KIC_7","KIC_8","KIC_9","MCD_10","MCD_1","MCD_2","MCD_3","MCD_4","MCD_5","MCD_6","MCD_7","MCD_8","MCD_9","PVD_10","PVD_1","PVD_2","PVD_3","PVD_4","PVD_5","PVD_6","PVD_7","PVD_8","PVD_9") # Adding individual names to rows of matrix
dim(gen_noinvert_6p9g9)
## [1]    60 21070
#LFMM requires a complete dataframe. I am using the snp_6.lfmm_imputed_tab.lfmm file created for running lfmm above, which imputes missing data based on the most common genotype for each SNP. I will move forward with this and see what happens
# No NAs 
sum(is.na(gen_noinvert_6p9g9))
## [1] 0
env6p9g9_red.lfmm <- clim.env_6pops_pca_red_full
env6p9g9_red.lfmm$Individual <- c("BAR_10","BAR_1","BAR_2","BAR_3","BAR_4","BAR_5","BAR_6","BAR_7","BAR_8","BAR_9","BIS_10","BIS_1","BIS_2","BIS_3","BIS_4","BIS_5","BIS_6","BIS_7","BIS_8","BIS_9","GB_10","GB_1","GB_2","GB_3","GB_4","GB_5","GB_6","GB_7","GB_8","GB_9","KIC_10","KIC_1","KIC_2","KIC_3","KIC_4","KIC_5","KIC_6","KIC_7","KIC_8","KIC_9","MCD_10","MCD_1","MCD_2","MCD_3","MCD_4","MCD_5","MCD_6","MCD_7","MCD_8","MCD_9","PVD_10","PVD_1","PVD_2","PVD_3","PVD_4","PVD_5","PVD_6","PVD_7","PVD_8","PVD_9")
str(env6p9g9_red.lfmm)
## 'data.frame':    60 obs. of  17 variables:
##  $ Individual       : chr  "BAR_10" "BAR_1" "BAR_2" "BAR_3" ...
##  $ Population       : Factor w/ 6 levels "BAR","BIS","GB",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Latitude         : num  41.7 41.7 41.7 41.7 41.7 ...
##  $ Longitude        : num  -71.3 -71.3 -71.3 -71.3 -71.3 ...
##  $ SewageEffluent   : num  17.9 17.9 17.9 17.9 17.9 ...
##  $ Temp_Min         : num  12.2 12.2 12.2 12.2 12.2 12.2 12.2 12.2 12.2 12.2 ...
##  $ June_Salinity_Max: num  30.5 30.5 30.5 30.5 30.5 30.5 30.5 30.5 30.5 30.5 ...
##  $ Temp_Max         : num  28.5 28.5 28.5 28.5 28.5 28.5 28.5 28.5 28.5 28.5 ...
##  $ July_Temp_Min    : num  21.1 21.1 21.1 21.1 21.1 21.1 21.1 21.1 21.1 21.1 ...
##  $ July_Temp_Avg    : num  24.4 24.4 24.4 24.4 24.4 ...
##  $ June_Salinity_Avg: num  27.9 27.9 27.9 27.9 27.9 ...
##  $ June_Salinity_Min: num  25 25 25 25 25 25 25 25 25 25 ...
##  $ pH_Avg           : num  7.69 7.69 7.69 7.69 7.69 7.69 7.69 7.69 7.69 7.69 ...
##  $ June_DO_Avg      : num  5.71 5.71 5.71 5.71 5.71 5.71 5.71 5.71 5.71 5.71 ...
##  $ PC1              : num  -3.65 -3.65 -3.65 -3.65 -3.65 ...
##  $ PC2              : num  2 2 2 2 2 ...
##  $ PC3              : num  -0.878 -0.878 -0.878 -0.878 -0.878 ...
#subsetting the environmental dataset to just include the environmental variables
pred6p9g9_red <- env_6pops_pca_red
pred6p9g9_red_noPCs <- env_6pops_pca_red[,1:10]
# Running a PCA on the environmental variables. We can use the first PC as a synthetic predictor
pred6p9g9_red.pca <- rda(pred6p9g9_red_noPCs, scale=T)
summary(pred6p9g9_red.pca)$cont
## $importance
## Importance of components:
##                          PC1    PC2    PC3    PC4      PC5
## Eigenvalue            5.1384 2.3949 1.3488 1.0301 0.087742
## Proportion Explained  0.5138 0.2395 0.1349 0.1030 0.008774
## Cumulative Proportion 0.5138 0.7533 0.8882 0.9912 1.000000
#plotting PCA 
plot(pred6p9g9_red.pca)

screeplot(pred6p9g9_red.pca, main = "Screeplot: Eigenvalues of Oyster Predictor Variables")

## correlations between the PC axis and predictors:
round(scores(pred6p9g9_red.pca, choices=1:5, display="species", scaling=0), digits=3)
##                      PC1    PC2    PC3    PC4    PC5
## SewageEffluent     0.111 -0.429  0.293 -0.591  0.485
## Temp_Min           0.438  0.063 -0.046 -0.018 -0.187
## June_Salinity_Max  0.414 -0.063  0.074  0.312  0.061
## Temp_Max          -0.373  0.322 -0.088  0.159  0.054
## July_Temp_Min     -0.041  0.454 -0.040 -0.685 -0.389
## July_Temp_Avg     -0.363  0.149 -0.419 -0.070  0.572
## June_Salinity_Avg  0.317  0.434 -0.055  0.129  0.366
## June_Salinity_Min  0.364  0.328 -0.167 -0.151  0.109
## pH_Avg             0.039  0.414  0.651  0.049  0.282
## June_DO_Avg       -0.342  0.100  0.516  0.109 -0.140
## attr(,"const")
## [1] 4.92848

PC1 explains 51% of the variation, PC2 explains 24% and PC3 explains 13%. The strongest correlations with PC1 are Temp_Min, June_Salinity_Max and Temp_Max

I could store the synthetic PC axis predictor as pred.PC1 and do the same for PC2 and PC3 as well. There’s an option to run LFMM using these PC scores as the predictors. I can also run LFMM on each individual environmental predictor. I’m going to opt for the latter as I’d like to see which predictors are associated with outlier SNPs. I’ve also already saved the PC 1-3 scores for the full environmetnal dataset. The next section will be very repetitive as I’m going to repeat the set of code for each predictor.

# Save each predictor as its own variable

# Sewage Effluent 
SE <- pred6p9g9_red$SewageEffluent
# Min Temp
Min_T <- pred6p9g9_red$Temp_Min
# Max June Salinity
Max_Sal_Jun <- pred6p9g9_red$June_Salinity_Max
# Max Temp
Max_T <- pred6p9g9_red$Temp_Max
# Max June Salinity
Min_T_Jul <- pred6p9g9_red$July_Temp_Min
# Avg Temp July
Avg_T_Jul <- pred6p9g9_red$July_Temp_Avg
# Avg Sal Jun
Avg_Sal_Jun <- pred6p9g9_red$June_Salinity_Avg
# Min June Salinity
Min_Sal_Jun <- pred6p9g9_red$June_Salinity_Min
# Avg pH
Avg_pH <- pred6p9g9_red$pH_Avg
# DO Avg June
Avg_DO_Jun <- pred6p9g9_red$June_DO_Avg
# PC1
PC1 <- pred6p9g9_red$PC1
# PC2
PC2 <- pred6p9g9_red$PC2
# PC3
PC3 <- pred6p9g9_red$PC3

Determine K (estimate of number of populations in the data)

#Using broken stick criterion to determine k - PCs should be retained as long as observed eigenvalues are higher than corresponding random broken stick components 

#Testing it out on the environmental data
screeplot(pred6p9g9_red.pca, main = "Screeplot of Oyster Predictor Variables with Broken Stick", bstick=TRUE, type="barplot")

PC1 explains more than the random broken stick components, while PC2 + do not. If this were genomic data, and we were determining a value of K using this approach, we’d set K = 1.

# Now looking at the genetic data
gen6p9g9.pca <- rda(gen6p9g9, scale=T)
screeplot(gen6p9g9.pca, main = "Screeplot of Genetic Data with Broken Stick", bstick=TRUE, type="barplot")

For the genomic data, we can see that none of the PCs have eigenvalues greater than random (greater than the broken stick values in red). This effectively means that K=1 for the genomic data set, based on a PCA assessment. I also tried K = 3 based on the scree plot method. A higher K increased the GIF value for all predictors. Having a K = 1 identified one additional outlier SNP for pH_Avg and PC2, so I will move forward with K = 1 for the LFMM2 analysis.

Repeating for no invert dataset

# Now looking at the genetic data
gen_noinvert_6p9g9.pca <- rda(gen_noinvert_6p9g9, scale=T)
screeplot(gen_noinvert_6p9g9.pca, main = "Screeplot of Genetic Data with Broken Stick", bstick=TRUE, type="barplot")

For the genomic data, we can see that none of the PCs have eigenvalues greater than random (greater than the broken stick values in red). This effectively means that K=1 for the genomic data set, based on a PCA assessment.

K <- 1

Run LFMM

Sewage Effluent

oys6p9g9_red_SE.lfmm <- lfmm_ridge(Y=gen6p9g9, X=SE, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_SE.pv <- lfmm_test(Y=gen6p9g9, X=SE, lfmm=oys6p9g9_red_SE.lfmm, calibrate="gif")

names(oys6p9g9_red_SE.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.08, which I'm happy with.
oys6p9g9_red_SE.pv$gif
## [1] 1.078058

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_SE.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_SE.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_SE <- oys6p9g9_red_SE.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_SE <- oys6p9g9_red_SE.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.078058
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv6p9g9_red_SE <- pchisq(zscore_6p9g9_red_SE^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_SE.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_SE.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.08)")

hist(adj.pv6p9g9_red_SE, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_SE.qv <- qvalue(oys6p9g9_red_SE.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_SE.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 0
oys6p9g9_red_SE.FDR.1 <- which(oys6p9g9_red_SE.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_SE.FDR.1
## integer(0)

Temp Min

oys6p9g9_red_Min_T.lfmm <- lfmm_ridge(Y=gen6p9g9, X=Min_T, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_Min_T.pv <- lfmm_test(Y=gen6p9g9, X=Min_T, lfmm=oys6p9g9_red_Min_T.lfmm, calibrate="gif")

names(oys6p9g9_red_Min_T.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.07, which I'm happy with.
oys6p9g9_red_Min_T.pv$gif
## [1] 1.071593

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_Min_T.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_Min_T.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_Min_T <- oys6p9g9_red_Min_T.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_Min_T <- oys6p9g9_red_Min_T.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.071593
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv6p9g9_red_Min_T <- pchisq(zscore_6p9g9_red_Min_T^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_Min_T.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_Min_T.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.07)")

hist(adj.pv6p9g9_red_Min_T, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_Min_T.qv <- qvalue(oys6p9g9_red_Min_T.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_Min_T.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 1
oys6p9g9_red_Min_T.FDR.1 <- which(oys6p9g9_red_Min_T.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_Min_T.FDR.1
## [1] 12105
invisible(lapply(oys6p9g9_red_Min_T.FDR.1, write, "outliers_lfmm2_6p9g9_red_Min_T.txt", append=TRUE))

June Salinity Max

oys6p9g9_red_Max_Sal_Jun.lfmm <- lfmm_ridge(Y=gen6p9g9, X=Max_Sal_Jun, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_Max_Sal_Jun.pv <- lfmm_test(Y=gen6p9g9, X=Max_Sal_Jun, lfmm=oys6p9g9_red_Max_Sal_Jun.lfmm, calibrate="gif")

names(oys6p9g9_red_Max_Sal_Jun.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.11.
oys6p9g9_red_Max_Sal_Jun.pv$gif
## [1] 1.110477

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_Max_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_Max_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_Max_Sal_Jun <- oys6p9g9_red_Max_Sal_Jun.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_Max_Sal_Jun <- oys6p9g9_red_Max_Sal_Jun.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.110477
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv6p9g9_red_Max_Sal_Jun <- pchisq(zscore_6p9g9_red_Max_Sal_Jun^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_Max_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_Max_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.11)")

hist(adj.pv6p9g9_red_Max_Sal_Jun, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_Max_Sal_Jun.qv <- qvalue(oys6p9g9_red_Max_Sal_Jun.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_Max_Sal_Jun.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 2
oys6p9g9_red_Max_Sal_Jun.FDR.1 <- which(oys6p9g9_red_Max_Sal_Jun.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_Max_Sal_Jun.FDR.1
## [1]  1364 12105
invisible(lapply(oys6p9g9_red_Max_Sal_Jun.FDR.1, write, "outliers_lfmm2_6p9g9_red_Max_Sal_Jun.txt", append=TRUE))

Temp Max

oys6p9g9_red_Max_T.lfmm <- lfmm_ridge(Y=gen6p9g9, X=Max_T, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_Max_T.pv <- lfmm_test(Y=gen6p9g9, X=Max_T, lfmm=oys6p9g9_red_Max_T.lfmm, calibrate="gif")

names(oys6p9g9_red_Max_T.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.08, which I'm happy with.
oys6p9g9_red_Max_T.pv$gif
## [1] 1.082925

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_Max_T.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_Max_T.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_Max_T <- oys6p9g9_red_Max_T.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_Max_T <- oys6p9g9_red_Max_T.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.082925
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv6p9g9_red_Max_T <- pchisq(zscore_6p9g9_red_Max_T^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_Max_T.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_Max_T.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.08)")

hist(adj.pv6p9g9_red_Max_T, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_Max_T.qv <- qvalue(oys6p9g9_red_Max_T.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_Max_T.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 1
oys6p9g9_red_Max_T.FDR.1 <- which(oys6p9g9_red_Max_T.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_Max_T.FDR.1
## [1] 20651
invisible(lapply(oys6p9g9_red_Max_T.FDR.1, write, "outliers_lfmm2_6p9g9_red_Max_T.txt", append=TRUE))

July Temp Min

oys6p9g9_red_Min_T_Jul.lfmm <- lfmm_ridge(Y=gen6p9g9, X=Min_T_Jul, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_Min_T_Jul.pv <- lfmm_test(Y=gen6p9g9, X=Min_T_Jul, lfmm=oys6p9g9_red_Min_T_Jul.lfmm, calibrate="gif")

names(oys6p9g9_red_Min_T_Jul.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.06, which I'm happy with.
oys6p9g9_red_Min_T_Jul.pv$gif
## [1] 1.055777

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_Min_T_Jul.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_Min_T_Jul.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_Min_T_Jul <- oys6p9g9_red_Min_T_Jul.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_Min_T_Jul <- oys6p9g9_red_Min_T_Jul.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.055777
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv6p9g9_red_Min_T_Jul <- pchisq(zscore_6p9g9_red_Min_T_Jul^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_Min_T_Jul.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_Min_T_Jul.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.06)")

hist(adj.pv6p9g9_red_Min_T_Jul, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_Min_T_Jul.qv <- qvalue(oys6p9g9_red_Min_T_Jul.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_Min_T_Jul.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 2
oys6p9g9_red_Min_T_Jul.FDR.1 <- which(oys6p9g9_red_Min_T_Jul.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_Min_T_Jul.FDR.1
## [1] 12372 19400
invisible(lapply(oys6p9g9_red_Min_T_Jul.FDR.1, write, "outliers_lfmm2_6p9g9_red_Min_T_Jul.txt", append=TRUE))

July Temp Avg

oys6p9g9_red_Avg_T_Jul.lfmm <- lfmm_ridge(Y=gen6p9g9, X=Avg_T_Jul, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_Avg_T_Jul.pv <- lfmm_test(Y=gen6p9g9, X=Avg_T_Jul, lfmm=oys6p9g9_red_Avg_T_Jul.lfmm, calibrate="gif")

names(oys6p9g9_red_Avg_T_Jul.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.13
oys6p9g9_red_Avg_T_Jul.pv$gif
## [1] 1.131646

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_Avg_T_Jul.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_Avg_T_Jul.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_Avg_T_Jul <- oys6p9g9_red_Avg_T_Jul.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_Avg_T_Jul <- oys6p9g9_red_Avg_T_Jul.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.131646
new.gif <- 1.00            ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv6p9g9_red_Avg_T_Jul <- pchisq(zscore_6p9g9_red_Avg_T_Jul^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_Avg_T_Jul.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_Avg_T_Jul.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.13)")

hist(adj.pv6p9g9_red_Avg_T_Jul, main="REadjusted p-values (GIF=1.07)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_Avg_T_Jul.qv <- qvalue(oys6p9g9_red_Avg_T_Jul.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_Avg_T_Jul.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 3
oys6p9g9_red_Avg_T_Jul.FDR.1 <- which(oys6p9g9_red_Avg_T_Jul.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_Avg_T_Jul.FDR.1
## [1] 13926 13927 20651
invisible(lapply(oys6p9g9_red_Avg_T_Jul.FDR.1, write, "outliers_lfmm2_6p9g9_red_Avg_T_Jul.txt", append=TRUE))

June Salinity Avg

oys6p9g9_red_Avg_Sal_Jun.lfmm <- lfmm_ridge(Y=gen6p9g9, X=Avg_Sal_Jun, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_Avg_Sal_Jun.pv <- lfmm_test(Y=gen6p9g9, X=Avg_Sal_Jun, lfmm=oys6p9g9_red_Avg_Sal_Jun.lfmm, calibrate="gif")

names(oys6p9g9_red_Avg_Sal_Jun.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.08, which I'm happy with.
oys6p9g9_red_Avg_Sal_Jun.pv$gif
## [1] 1.077006

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_Avg_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_Avg_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_Avg_Sal_Jun <- oys6p9g9_red_Avg_Sal_Jun.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_Avg_Sal_Jun <- oys6p9g9_red_Avg_Sal_Jun.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.077006
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv6p9g9_red_Avg_Sal_Jun <- pchisq(zscore_6p9g9_red_Avg_Sal_Jun^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_Avg_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_Avg_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.08)")

hist(adj.pv6p9g9_red_Avg_Sal_Jun, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_Avg_Sal_Jun.qv <- qvalue(oys6p9g9_red_Avg_Sal_Jun.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_Avg_Sal_Jun.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 2
oys6p9g9_red_Avg_Sal_Jun.FDR.1 <- which(oys6p9g9_red_Avg_Sal_Jun.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_Avg_Sal_Jun.FDR.1
## [1] 10757 17899
invisible(lapply(oys6p9g9_red_Avg_Sal_Jun.FDR.1, write, "outliers_lfmm2_6p9g9_red_Avg_Sal_Jun.txt", append=TRUE))

June Salinity Min

oys6p9g9_red_Min_Sal_Jun.lfmm <- lfmm_ridge(Y=gen6p9g9, X=Min_Sal_Jun, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_Min_Sal_Jun.pv <- lfmm_test(Y=gen6p9g9, X=Min_Sal_Jun, lfmm=oys6p9g9_red_Min_Sal_Jun.lfmm, calibrate="gif")

names(oys6p9g9_red_Min_Sal_Jun.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.08, which I'm happy with.
oys6p9g9_red_Min_Sal_Jun.pv$gif
## [1] 1.077292

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_Min_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_Min_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_Min_Sal_Jun <- oys6p9g9_red_Min_Sal_Jun.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_Min_Sal_Jun <- oys6p9g9_red_Min_Sal_Jun.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.077292
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv6p9g9_red_Min_Sal_Jun <- pchisq(zscore_6p9g9_red_Min_Sal_Jun^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_Min_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_Min_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.08)")

hist(adj.pv6p9g9_red_Min_Sal_Jun, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_Min_Sal_Jun.qv <- qvalue(oys6p9g9_red_Min_Sal_Jun.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_Min_Sal_Jun.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 1
oys6p9g9_red_Min_Sal_Jun.FDR.1 <- which(oys6p9g9_red_Min_Sal_Jun.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_Min_Sal_Jun.FDR.1
## [1] 10757
invisible(lapply(oys6p9g9_red_Min_Sal_Jun.FDR.1, write, "outliers_lfmm2_6p9g9_red_Min_Sal_Jun.txt", append=TRUE))

pH Avg

oys6p9g9_red_Avg_pH.lfmm <- lfmm_ridge(Y=gen6p9g9, X=Avg_pH, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_Avg_pH.pv <- lfmm_test(Y=gen6p9g9, X=Avg_pH, lfmm=oys6p9g9_red_Avg_pH.lfmm, calibrate="gif")

names(oys6p9g9_red_Avg_pH.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.07, which I'm happy with.
oys6p9g9_red_Avg_pH.pv$gif
## [1] 1.068287

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_Avg_pH.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_Avg_pH.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_Avg_pH <- oys6p9g9_red_Avg_pH.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_Avg_pH <- oys6p9g9_red_Avg_pH.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.068287
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv6p9g9_red_Avg_pH <- pchisq(zscore_6p9g9_red_Avg_pH^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_Avg_pH.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_Avg_pH.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.07)")

hist(adj.pv6p9g9_red_Avg_pH, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_Avg_pH.qv <- qvalue(oys6p9g9_red_Avg_pH.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_Avg_pH.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 1
oys6p9g9_red_Avg_pH.FDR.1 <- which(oys6p9g9_red_Avg_pH.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_Avg_pH.FDR.1
## [1] 16988
invisible(lapply(oys6p9g9_red_Avg_pH.FDR.1, write, "outliers_lfmm2_6p9g9_red_Avg_pH.txt", append=TRUE))

June DO Avg

oys6p9g9_red_Avg_DO_Jun.lfmm <- lfmm_ridge(Y=gen6p9g9, X=Avg_DO_Jun, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_Avg_DO_Jun.pv <- lfmm_test(Y=gen6p9g9, X=Avg_DO_Jun, lfmm=oys6p9g9_red_Avg_DO_Jun.lfmm, calibrate="gif")

names(oys6p9g9_red_Avg_DO_Jun.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.12, which I'm happy with.
oys6p9g9_red_Avg_DO_Jun.pv$gif
## [1] 1.118573

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_Avg_DO_Jun.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_Avg_DO_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_Avg_DO_Jun <- oys6p9g9_red_Avg_DO_Jun.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_Avg_DO_Jun <- oys6p9g9_red_Avg_DO_Jun.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.118573
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv6p9g9_red_Avg_DO_Jun <- pchisq(zscore_6p9g9_red_Avg_DO_Jun^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_Avg_DO_Jun.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_Avg_DO_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.12)")

hist(adj.pv6p9g9_red_Avg_DO_Jun, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_Avg_DO_Jun.qv <- qvalue(oys6p9g9_red_Avg_DO_Jun.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_Avg_DO_Jun.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 0
oys6p9g9_red_Avg_DO_Jun.FDR.1 <- which(oys6p9g9_red_Avg_DO_Jun.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_Avg_DO_Jun.FDR.1
## integer(0)

PC1

oys6p9g9_red_PC1.lfmm <- lfmm_ridge(Y=gen6p9g9, X=PC1, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_PC1.pv <- lfmm_test(Y=gen6p9g9, X=PC1, lfmm=oys6p9g9_red_PC1.lfmm, calibrate="gif")

names(oys6p9g9_red_PC1.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.09, which I'm happy with.
oys6p9g9_red_PC1.pv$gif
## [1] 1.087458

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_PC1.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_PC1.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_PC1 <- oys6p9g9_red_PC1.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_PC1 <- oys6p9g9_red_PC1.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.087458
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv6p9g9_red_PC1 <- pchisq(zscore_6p9g9_red_PC1^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_PC1.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_PC1.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.09)")

hist(adj.pv6p9g9_red_PC1, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_PC1.qv <- qvalue(oys6p9g9_red_PC1.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_PC1.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 2
oys6p9g9_red_PC1.FDR.1 <- which(oys6p9g9_red_PC1.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_PC1.FDR.1
## [1] 12105 20651
invisible(lapply(oys6p9g9_red_PC1.FDR.1, write, "outliers_lfmm2_6p9g9_red_PC1.txt", append=TRUE))

PC2

oys6p9g9_red_PC2.lfmm <- lfmm_ridge(Y=gen6p9g9, X=PC2, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_PC2.pv <- lfmm_test(Y=gen6p9g9, X=PC2, lfmm=oys6p9g9_red_PC2.lfmm, calibrate="gif")

names(oys6p9g9_red_PC2.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.05, which I'm happy with.
oys6p9g9_red_PC2.pv$gif
## [1] 1.052841

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_PC2.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_PC2.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_PC2 <- oys6p9g9_red_PC2.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_PC2 <- oys6p9g9_red_PC2.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.052841
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv6p9g9_red_PC2 <- pchisq(zscore_6p9g9_red_PC2^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_PC2.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_PC2.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.05)")

hist(adj.pv6p9g9_red_PC2, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_PC2.qv <- qvalue(oys6p9g9_red_PC2.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_PC2.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 4
oys6p9g9_red_PC2.FDR.1 <- which(oys6p9g9_red_PC2.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_PC2.FDR.1
## [1]  4960  5947 18936 19091
invisible(lapply(oys6p9g9_red_PC2.FDR.1, write, "outliers_lfmm2_6p9g9_red_PC2.txt", append=TRUE))

PC3

oys6p9g9_red_PC3.lfmm <- lfmm_ridge(Y=gen6p9g9, X=PC3, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_PC3.pv <- lfmm_test(Y=gen6p9g9, X=PC3, lfmm=oys6p9g9_red_PC3.lfmm, calibrate="gif")

names(oys6p9g9_red_PC3.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.02, which I'm happy with.
oys6p9g9_red_PC3.pv$gif
## [1] 1.022614

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_PC3.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_PC3.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_PC3 <- oys6p9g9_red_PC3.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_PC3 <- oys6p9g9_red_PC3.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.022614
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv6p9g9_red_PC3 <- pchisq(zscore_6p9g9_red_PC3^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_PC3.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys6p9g9_red_PC3.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.03)")

hist(adj.pv6p9g9_red_PC3, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_PC3.qv <- qvalue(oys6p9g9_red_PC3.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_PC3.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 0
oys6p9g9_red_PC3.FDR.1 <- which(oys6p9g9_red_PC3.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_PC3.FDR.1
## integer(0)

Repeating with no invert dataset

Run LFMM

K <- 1

Sewage Effluent

oys_noinvert_6p9g9_red_SE.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=SE, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_SE.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=SE, lfmm=oys_noinvert_6p9g9_red_SE.lfmm, calibrate="gif")

names(oys_noinvert_6p9g9_red_SE.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.09, which I'm happy with.
oys_noinvert_6p9g9_red_SE.pv$gif
## [1] 1.09136

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_SE.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_SE.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_SE <- oys_noinvert_6p9g9_red_SE.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_SE <- oys_noinvert_6p9g9_red_SE.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.09136
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_SE <- pchisq(zscore__noinvert_6p9g9_red_SE^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_SE.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_SE.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.09)")

hist(adj.pv_noinvert_6p9g9_red_SE, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_SE.qv <- qvalue(oys_noinvert_6p9g9_red_SE.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_SE.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 0
oys_noinvert_6p9g9_red_SE.FDR.1 <- which(oys_noinvert_6p9g9_red_SE.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_SE.FDR.1
## integer(0)

Temp Min

oys_noinvert_6p9g9_red_Min_T.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=Min_T, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_Min_T.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=Min_T, lfmm=oys_noinvert_6p9g9_red_Min_T.lfmm, calibrate="gif")

names(oys_noinvert_6p9g9_red_Min_T.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.08.
oys_noinvert_6p9g9_red_Min_T.pv$gif
## [1] 1.084172

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_Min_T.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_Min_T.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_Min_T <- oys_noinvert_6p9g9_red_Min_T.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_Min_T <- oys_noinvert_6p9g9_red_Min_T.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.084172
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_Min_T <- pchisq(zscore__noinvert_6p9g9_red_Min_T^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_Min_T.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_Min_T.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.08)")

hist(adj.pv_noinvert_6p9g9_red_Min_T, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_Min_T.qv <- qvalue(oys_noinvert_6p9g9_red_Min_T.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_Min_T.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 1
oys_noinvert_6p9g9_red_Min_T.FDR.1 <- which(oys_noinvert_6p9g9_red_Min_T.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_Min_T.FDR.1
## [1] 12105
invisible(lapply(oys_noinvert_6p9g9_red_Min_T.FDR.1, write, "outliers_lfmm2_noinvert_6p9g9_red_Min_T.txt", append=TRUE))

June Salinity Max

oys_noinvert_6p9g9_red_Max_Sal_Jun.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=Max_Sal_Jun, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_Max_Sal_Jun.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=Max_Sal_Jun, lfmm=oys_noinvert_6p9g9_red_Max_Sal_Jun.lfmm, calibrate="gif")

names(oys_noinvert_6p9g9_red_Max_Sal_Jun.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.13.
oys_noinvert_6p9g9_red_Max_Sal_Jun.pv$gif
## [1] 1.128308

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_Max_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_Max_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_Max_Sal_Jun <- oys_noinvert_6p9g9_red_Max_Sal_Jun.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_Max_Sal_Jun <- oys_noinvert_6p9g9_red_Max_Sal_Jun.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.128308
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_Max_Sal_Jun <- pchisq(zscore__noinvert_6p9g9_red_Max_Sal_Jun^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_Max_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_Max_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.13)")

hist(adj.pv_noinvert_6p9g9_red_Max_Sal_Jun, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_Max_Sal_Jun.qv <- qvalue(oys_noinvert_6p9g9_red_Max_Sal_Jun.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_Max_Sal_Jun.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 2
oys_noinvert_6p9g9_red_Max_Sal_Jun.FDR.1 <- which(oys_noinvert_6p9g9_red_Max_Sal_Jun.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_Max_Sal_Jun.FDR.1
## [1]  1364 12105
invisible(lapply(oys_noinvert_6p9g9_red_Max_Sal_Jun.FDR.1, write, "outliers_lfmm2_noinvert_6p9g9_red_Max_Sal_Jun.txt", append=TRUE))

Temp Max

oys_noinvert_6p9g9_red_Max_T.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=Max_T, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_Max_T.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=Max_T, lfmm=oys_noinvert_6p9g9_red_Max_T.lfmm, calibrate="gif")

names(oys_noinvert_6p9g9_red_Max_T.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.09, which I'm happy with.
oys_noinvert_6p9g9_red_Max_T.pv$gif
## [1] 1.088141

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_Max_T.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_Max_T.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_Max_T <- oys_noinvert_6p9g9_red_Max_T.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_Max_T <- oys_noinvert_6p9g9_red_Max_T.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.088141
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_Max_T <- pchisq(zscore__noinvert_6p9g9_red_Max_T^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_Max_T.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_Max_T.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.09)")

hist(adj.pv_noinvert_6p9g9_red_Max_T, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_Max_T.qv <- qvalue(oys_noinvert_6p9g9_red_Max_T.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_Max_T.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 1
oys_noinvert_6p9g9_red_Max_T.FDR.1 <- which(oys_noinvert_6p9g9_red_Max_T.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_Max_T.FDR.1
## [1] 18847
invisible(lapply(oys_noinvert_6p9g9_red_Max_T.FDR.1, write, "outliers_lfmm2_noinvert_6p9g9_red_Max_T.txt", append=TRUE))

July Temp Min

oys_noinvert_6p9g9_red_Min_T_Jul.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=Min_T_Jul, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_Min_T_Jul.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=Min_T_Jul, lfmm=oys_noinvert_6p9g9_red_Min_T_Jul.lfmm, calibrate="gif")

names(oys_noinvert_6p9g9_red_Min_T_Jul.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.07, which I'm happy with.
oys_noinvert_6p9g9_red_Min_T_Jul.pv$gif
## [1] 1.07285

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_Min_T_Jul.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_Min_T_Jul.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_Min_T_Jul <- oys_noinvert_6p9g9_red_Min_T_Jul.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_Min_T_Jul <- oys_noinvert_6p9g9_red_Min_T_Jul.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.07285
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_Min_T_Jul <- pchisq(zscore__noinvert_6p9g9_red_Min_T_Jul^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_Min_T_Jul.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_Min_T_Jul.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.07)")

hist(adj.pv_noinvert_6p9g9_red_Min_T_Jul, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_Min_T_Jul.qv <- qvalue(oys_noinvert_6p9g9_red_Min_T_Jul.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_Min_T_Jul.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 2
oys_noinvert_6p9g9_red_Min_T_Jul.FDR.1 <- which(oys_noinvert_6p9g9_red_Min_T_Jul.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_Min_T_Jul.FDR.1
## [1] 12372 17596
invisible(lapply(oys_noinvert_6p9g9_red_Min_T_Jul.FDR.1, write, "outliers_lfmm2_noinvert_6p9g9_red_Min_T_Jul.txt", append=TRUE))

July Temp Avg

oys_noinvert_6p9g9_red_Avg_T_Jul.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=Avg_T_Jul, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_Avg_T_Jul.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=Avg_T_Jul, lfmm=oys_noinvert_6p9g9_red_Avg_T_Jul.lfmm, calibrate="gif")

names(oys_noinvert_6p9g9_red_Avg_T_Jul.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.13
oys_noinvert_6p9g9_red_Avg_T_Jul.pv$gif
## [1] 1.134126

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_Avg_T_Jul.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_Avg_T_Jul.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_Avg_T_Jul <- oys_noinvert_6p9g9_red_Avg_T_Jul.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_Avg_T_Jul <- oys_noinvert_6p9g9_red_Avg_T_Jul.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.134126
new.gif <- 1.00            ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_Avg_T_Jul <- pchisq(zscore__noinvert_6p9g9_red_Avg_T_Jul^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_Avg_T_Jul.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_Avg_T_Jul.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.13)")

hist(adj.pv_noinvert_6p9g9_red_Avg_T_Jul, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_Avg_T_Jul.qv <- qvalue(oys_noinvert_6p9g9_red_Avg_T_Jul.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_Avg_T_Jul.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 3
oys_noinvert_6p9g9_red_Avg_T_Jul.FDR.1 <- which(oys_noinvert_6p9g9_red_Avg_T_Jul.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_Avg_T_Jul.FDR.1
## [1] 13926 13927 18847
invisible(lapply(oys_noinvert_6p9g9_red_Avg_T_Jul.FDR.1, write, "outliers_lfmm2_noinvert_6p9g9_red_Avg_T_Jul.txt", append=TRUE))

June Salinity Avg

oys_noinvert_6p9g9_red_Avg_Sal_Jun.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=Avg_Sal_Jun, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_Avg_Sal_Jun.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=Avg_Sal_Jun, lfmm=oys_noinvert_6p9g9_red_Avg_Sal_Jun.lfmm, calibrate="gif")

names(oys_noinvert_6p9g9_red_Avg_Sal_Jun.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.07, which I'm happy with.
oys_noinvert_6p9g9_red_Avg_Sal_Jun.pv$gif
## [1] 1.070078

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_Avg_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_Avg_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_Avg_Sal_Jun <- oys_noinvert_6p9g9_red_Avg_Sal_Jun.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_Avg_Sal_Jun <- oys_noinvert_6p9g9_red_Avg_Sal_Jun.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.070078
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_Avg_Sal_Jun <- pchisq(zscore__noinvert_6p9g9_red_Avg_Sal_Jun^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_Avg_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_Avg_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.07)")

hist(adj.pv_noinvert_6p9g9_red_Avg_Sal_Jun, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_Avg_Sal_Jun.qv <- qvalue(oys_noinvert_6p9g9_red_Avg_Sal_Jun.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_Avg_Sal_Jun.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 1
oys_noinvert_6p9g9_red_Avg_Sal_Jun.FDR.1 <- which(oys_noinvert_6p9g9_red_Avg_Sal_Jun.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_Avg_Sal_Jun.FDR.1
## [1] 10757
invisible(lapply(oys_noinvert_6p9g9_red_Avg_Sal_Jun.FDR.1, write, "outliers_lfmm2_noinvert_6p9g9_red_Avg_Sal_Jun.txt", append=TRUE))

June Salinity Min

oys_noinvert_6p9g9_red_Min_Sal_Jun.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=Min_Sal_Jun, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_Min_Sal_Jun.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=Min_Sal_Jun, lfmm=oys_noinvert_6p9g9_red_Min_Sal_Jun.lfmm, calibrate="gif")

names(oys_noinvert_6p9g9_red_Min_Sal_Jun.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.07, which I'm happy with.
oys_noinvert_6p9g9_red_Min_Sal_Jun.pv$gif
## [1] 1.069792

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_Min_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_Min_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_Min_Sal_Jun <- oys_noinvert_6p9g9_red_Min_Sal_Jun.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_Min_Sal_Jun <- oys_noinvert_6p9g9_red_Min_Sal_Jun.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.069792
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_Min_Sal_Jun <- pchisq(zscore__noinvert_6p9g9_red_Min_Sal_Jun^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_Min_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_Min_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.07)")

hist(adj.pv_noinvert_6p9g9_red_Min_Sal_Jun, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_Min_Sal_Jun.qv <- qvalue(oys_noinvert_6p9g9_red_Min_Sal_Jun.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_Min_Sal_Jun.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 1
oys_noinvert_6p9g9_red_Min_Sal_Jun.FDR.1 <- which(oys_noinvert_6p9g9_red_Min_Sal_Jun.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_Min_Sal_Jun.FDR.1
## [1] 10757
invisible(lapply(oys_noinvert_6p9g9_red_Min_Sal_Jun.FDR.1, write, "outliers_lfmm2_noinvert_6p9g9_red_Min_Sal_Jun.txt", append=TRUE))

pH Avg

oys_noinvert_6p9g9_red_Avg_pH.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=Avg_pH, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_Avg_pH.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=Avg_pH, lfmm=oys_noinvert_6p9g9_red_Avg_pH.lfmm, calibrate="gif")

names(oys_noinvert_6p9g9_red_Avg_pH.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.05, which I'm happy with.
oys_noinvert_6p9g9_red_Avg_pH.pv$gif
## [1] 1.054805

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_Avg_pH.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_Avg_pH.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_Avg_pH <- oys_noinvert_6p9g9_red_Avg_pH.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_Avg_pH <- oys_noinvert_6p9g9_red_Avg_pH.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.054805
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_Avg_pH <- pchisq(zscore__noinvert_6p9g9_red_Avg_pH^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_Avg_pH.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_Avg_pH.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.05)")

hist(adj.pv_noinvert_6p9g9_red_Avg_pH, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_Avg_pH.qv <- qvalue(oys_noinvert_6p9g9_red_Avg_pH.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_Avg_pH.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 0
oys_noinvert_6p9g9_red_Avg_pH.FDR.1 <- which(oys_noinvert_6p9g9_red_Avg_pH.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_Avg_pH.FDR.1
## integer(0)
# invisible(lapply(oys_noinvert_6p9g9_red_Avg_pH.FDR.1, write, "outliers_lfmm2_noinvert_6p9g9_red_Avg_pH.txt", append=TRUE))

June DO Avg

oys_noinvert_6p9g9_red_Avg_DO_Jun.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=Avg_DO_Jun, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_Avg_DO_Jun.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=Avg_DO_Jun, lfmm=oys_noinvert_6p9g9_red_Avg_DO_Jun.lfmm, calibrate="gif")

names(oys_noinvert_6p9g9_red_Avg_DO_Jun.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.10, which I'm happy with.
oys_noinvert_6p9g9_red_Avg_DO_Jun.pv$gif
## [1] 1.099362

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_Avg_DO_Jun.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_Avg_DO_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_Avg_DO_Jun <- oys_noinvert_6p9g9_red_Avg_DO_Jun.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_Avg_DO_Jun <- oys_noinvert_6p9g9_red_Avg_DO_Jun.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.099362
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_Avg_DO_Jun <- pchisq(zscore__noinvert_6p9g9_red_Avg_DO_Jun^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_Avg_DO_Jun.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_Avg_DO_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.10)")

hist(adj.pv_noinvert_6p9g9_red_Avg_DO_Jun, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_Avg_DO_Jun.qv <- qvalue(oys_noinvert_6p9g9_red_Avg_DO_Jun.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_Avg_DO_Jun.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 0
oys_noinvert_6p9g9_red_Avg_DO_Jun.FDR.1 <- which(oys_noinvert_6p9g9_red_Avg_DO_Jun.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_Avg_DO_Jun.FDR.1
## integer(0)

PC1

oys_noinvert_6p9g9_red_PC1.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=PC1, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_PC1.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=PC1, lfmm=oys_noinvert_6p9g9_red_PC1.lfmm, calibrate="gif")

names(oys_noinvert_6p9g9_red_PC1.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.10, which I'm happy with.
oys_noinvert_6p9g9_red_PC1.pv$gif
## [1] 1.104819

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_PC1.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_PC1.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_PC1 <- oys_noinvert_6p9g9_red_PC1.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_PC1 <- oys_noinvert_6p9g9_red_PC1.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.104819
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_PC1 <- pchisq(zscore__noinvert_6p9g9_red_PC1^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_PC1.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_PC1.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.09)")

hist(adj.pv_noinvert_6p9g9_red_PC1, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_PC1.qv <- qvalue(oys_noinvert_6p9g9_red_PC1.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_PC1.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 2
oys_noinvert_6p9g9_red_PC1.FDR.1 <- which(oys_noinvert_6p9g9_red_PC1.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_PC1.FDR.1
## [1] 12105 18847
invisible(lapply(oys_noinvert_6p9g9_red_PC1.FDR.1, write, "outliers_lfmm2_noinvert_6p9g9_red_PC1.txt", append=TRUE))

PC2

oys_noinvert_6p9g9_red_PC2.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=PC2, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_PC2.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=PC2, lfmm=oys_noinvert_6p9g9_red_PC2.lfmm, calibrate="gif")

names(oys_noinvert_6p9g9_red_PC2.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.06, which I'm happy with.
oys_noinvert_6p9g9_red_PC2.pv$gif
## [1] 1.059911

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_PC2.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_PC2.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_PC2 <- oys_noinvert_6p9g9_red_PC2.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_PC2 <- oys_noinvert_6p9g9_red_PC2.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.059911
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_PC2 <- pchisq(zscore__noinvert_6p9g9_red_PC2^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_PC2.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_PC2.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.06)")

hist(adj.pv_noinvert_6p9g9_red_PC2, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_PC2.qv <- qvalue(oys_noinvert_6p9g9_red_PC2.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_PC2.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 0
oys_noinvert_6p9g9_red_PC2.FDR.1 <- which(oys_noinvert_6p9g9_red_PC2.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_PC2.FDR.1
## integer(0)
# invisible(lapply(oys_noinvert_6p9g9_red_PC2.FDR.1, write, "outliers_lfmm2_noinvert_6p9g9_red_PC2.txt", append=TRUE))

PC3

oys_noinvert_6p9g9_red_PC3.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=PC3, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_PC3.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=PC3, lfmm=oys_noinvert_6p9g9_red_PC3.lfmm, calibrate="gif")

names(oys_noinvert_6p9g9_red_PC3.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B"                 "epsilon.sigma2"    "B.sigma2"         
## [4] "score"             "pvalue"            "gif"              
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.03, which I'm happy with.
oys_noinvert_6p9g9_red_PC3.pv$gif
## [1] 1.0309

An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.

# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_PC3.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_PC3.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")

There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.

# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_PC3 <- oys_noinvert_6p9g9_red_PC3.pv$score[,1]   # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_PC3 <- oys_noinvert_6p9g9_red_PC3.pv$gif[1])       ## d.fault GIF for this predictor
## [1] 1.0309
new.gif <- 1.00             ## choose your new GIF

# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_PC3 <- pchisq(zscore__noinvert_6p9g9_red_PC3^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_PC3.pv$pvalue[,1], main="Unadjusted p-values")        

hist(oys_noinvert_6p9g9_red_PC3.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.03)")

hist(adj.pv_noinvert_6p9g9_red_PC3, main="REadjusted p-values (GIF=1.00)")

#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_PC3.qv <- qvalue(oys_noinvert_6p9g9_red_PC3.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_PC3.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 0
oys_noinvert_6p9g9_red_PC3.FDR.1 <- which(oys_noinvert_6p9g9_red_PC3.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_PC3.FDR.1
## integer(0)

Prep outlier SNP files

Combine all LFMM2 outliers into one list:

In terminal:

This will save only unique outliers (no duplicates)

$ cat outliers_lfmm2* | sort | uniq > red_combined_unique_outliers_lfmm2_6p9g9.txt

To view:

$ cat red_combined_unique_outliers_lfmm2_6p9g9.txt  

    output:
    10757
    12105
    12372
    1364
    13926
    13927
    16988
    17899
    18936
    19091
    19400
    20651
    4960
    5947

no invert outliers

$ cat outliers_lfmm2_noinvert* | sort | uniq > red_combined_unique_outliers_lfmm2_no_invert_6p9g9.txt

To view:

$ cat red_combined_unique_outliers_lfmm2_no_invert_6p9g9.txt

    output:
    10757
    12105
    12372
    1364
    13926
    13927
    17596
    18847

Save outliers to new file

# Convert SNP indices to locus position and chromosome info
$ mawk '!/#/' SNP.TRS6newdp10mafp9g9nDNAmaf052A.recode.vcf | cut -f1,2 > totalloci_6p9g9
$ NUM=(`cat totalloci_6p9g9 | wc -l`)
$ paste <(seq 1 $NUM) totalloci_6p9g9 > loci_6p9g9.plus.index

# pcadapt outliers 
$ cat outliers_pcadapt_thinned500.txt | parallel "grep -w ^{} loci_6p9g9.plus.index" | cut -f2,3 > outlier_pcadapt_thinned500.loci.txt

# LFMM2_red outliers
$ cat red_combined_unique_outliers_lfmm2_6p9g9.txt | parallel "grep -w ^{} loci_6p9g9.plus.index" | cut -f2,3> outlier_lfmm2_6p9g9_red.loci.txt

$ cat outlier_lfmm2_6p9g9_red.loci.txt

    output:
    NC_035783.1 10105379
    NC_035783.1 35005081
    NC_035783.1 42294461
    NC_035780.1 24512990
    NC_035784.1 8396484
    NC_035784.1 8396493
    NC_035784.1 65267874 #inside inversion
    NC_035784.1 79541695 #inside inversion
    NC_035785.1 36953129 #inside inversion
    NC_035785.1 41726211 #inside inversion
    NC_035786.1 15925374
    NC_035787.1 34532963
    NC_035781.1 31586020
    NC_035781.1 49023293

The four SNPs within the inversion have associations with Avg_pH, Avg_Sal_Jun, and PC2.

Repeating for vcf with no invert

# Convert SNP indices to locus position and chromosome info
$ mawk '!/#/' SNP.TRS6newdp10mafp9g9nDNAmaf052A_noinvert.recode.vcf | cut -f1,2 > totalloci_6p9g9_noinvert
$ NUM=(`cat totalloci_6p9g9_noinvert | wc -l`)
$ paste <(seq 1 $NUM) totalloci_6p9g9_noinvert > loci_6p9g9_noinvert.plus.index

# LFMM2_red_noinvert outliers
$ cat red_combined_unique_outliers_lfmm2_no_invert_6p9g9.txt | parallel "grep -w ^{} loci_6p9g9_noinvert.plus.index" | cut -f2,3> outlier_lfmm2_6p9g9_red_noinvert.loci.txt

$ cat outlier_lfmm2_6p9g9_red_noinvert.loci.txt 

    output:
    NC_035783.1 10105379
    NC_035783.1 35005081
    NC_035783.1 42294461
    NC_035780.1 24512990
    NC_035784.1 8396484
    NC_035784.1 8396493
    NC_035786.1 15925374
    NC_035787.1 34532963

All 8 SNPs are shared with the full SNP data set. Of the 6 from the full dataset, 4 were within the inversions. The two that are unique to the full SNP data set are associated with PC2.

Combine all outlier loci into one file

To make this easier:

$ cat outlier_pcadapt_thinned500.loci.txt outlier_lfmm2_6p9g9_red.loci.txt  > all_6p9g9_red.outliers
$ cat all_6p9g9_red.outliers | sort | uniq | wc -l

    output:
    70

There are two SNPs shared between pcadapt and lfmm2, both are located in inversions: - NC_035784.1 65267874 - NC_035785.1 41726211

Separating outlier and neutral loci into 2 VCF files

Create VCF file with just neutral loci

$ vcftools --vcf SNP.TRS6newdp10mafp9g9nDNAmaf052A.recode.vcf --exclude-positions all_6p9g9_red.outliers --recode-INFO-all --out neutralloci6p9g9_red --recode

    output:
    After filtering, kept 60 out of 60 Individuals
    Outputting VCF file...
    After filtering, kept 22804 out of a possible 22874 Sites
    Run Time = 6.00 seconds

22804 neutral loci in neutralloci6p9g9_red.recode.vcf.

I’m going to make a second neutral loci vcf file with large inversion regions removed as they drive the population structure.

$ vcftools --vcf neutralloci6p9g9_red.recode.vcf --exclude-bed Detected_large_inversions.bed --recode-INFO-all --out neutralloci6p9g9_red_noinversion --recode

    output:
    After filtering, kept 60 out of 60 Individuals
    Outputting VCF file...
        Read 3 BED file entries.
    After filtering, kept 21038 out of a possible 22804 Sites
    Run Time = 6.00 seconds

Create VCF file with just outlier loci

$ vcftools --vcf SNP.TRS6newdp10mafp9g9nDNAmaf052A.recode.vcf --recode --recode-INFO-all --positions all_6p9g9_red.outliers --out outlierloci6p9g9_red

    output:
    After filtering, kept 60 out of 60 Individuals
    Outputting VCF file...
    After filtering, kept 70 out of a possible 22874 Sites
    Run Time = 0.00 seconds

70 outlier loci in outlierloci6p9g9.recode.vcf.

I’m also creating a vcf file of outliers including those unique only to pcadapt (56 SNPs). This will be used for a redundancy analysis to identify associations with the distance-based (db) outliers. 56 loci in outlierloci6p9g9_nolfmm.recode.vcf.

$ vcftools --vcf outlierloci6p9g9_red.recode.vcf --recode --recode-INFO-all --exclude-positions outlier_lfmm2_6p9g9_red.loci.txt --out outlierloci6p9g9_uniq_db

    output:
    After filtering, kept 60 out of 60 Individuals
    Outputting VCF file...
    After filtering, kept 56 out of a possible 70 Sites
    Run Time = 0.00 seconds

56 loci in outlierloci6p9g9_uniq_db.recode.vcf.

Now, I’m taking the unique to pcadapt vcf file and splitting it into a set with SNPs inside the inversions and set of SNPs outside the inversions.

# Inversions only
$ vcftools --vcf outlierloci6p9g9_uniq_db.recode.vcf --recode --recode-INFO-all --bed Detected_large_inversions.bed --out outliers6p9g9_uniq_db_invert

    output:
    After filtering, kept 60 out of 60 Individuals
    Outputting VCF file...
        Read 3 BED file entries.
    After filtering, kept 34 out of a possible 56 Sites
    Run Time = 0.00 seconds

# Inversions excluded
$ vcftools --vcf outlierloci6p9g9_uniq_db.recode.vcf --recode --recode-INFO-all --exclude-bed Detected_large_inversions.bed --out outliers6p9g9_uniq_db_noinvert

    output:
    After filtering, kept 60 out of 60 Individuals
    Outputting VCF file...
        Read 3 BED file entries.
    After filtering, kept 22 out of a possible 56 Sites
    Run Time = 0.00 seconds